Rework on ParquetDataset for easy access and better cache size in eager mode #384

yongtang · 2019-07-27T05:41:39Z

This fix is part of the effort to improve overall Dataset for easy access and better cache size in eager mode. See #382 and #366 for related discussions.

In order to be able to read file either in filename or in mmeory, this PR adds an SizedRandomAccessFile which allows to provide an optional memory buffer as file content. This could be useful in process compression or archives where we could just read the uncompressed file content into memory.

The preivous limitation in Dataset was that Dataset was a iterable so sequence length is unknown until graph runtime. In this PR, we provide an helper function to read the columns of parquet file and lenth is know.

This also could open other avenues such as map parquet file with getitem and len.
Further, parquet file could be read into a Tensor and processed easily (such as pandas like API).

The list_parquet_columns could be similarly applied to HDF5 which is more important: HDF5 could have dataset with different sizes.

Summary:

Two basic C++ kernel ops are implemnted: list_parquet_columns and read_parquet
One ParquetDataset that is python implementation only (no C++ anymore)
ParquetDataset support eager and graph mode, in graph mode, dtype and shape
are provided by user explicitly. In eager mode, only column name is needed.
read_parquet works in eager and graph mode, can read records either in full, or in slices
list_parquet_columns works in eager mode only (limitation).

For cache batch vs. batch in tf.keras

Added a hidden capacity to adjust the cache batch size
batch to be passed in tf.keras is unrelated to capacity, but we could use rebatch
to change at the end of the pipeline.
capacity could be padded to allow rebatch to only cut a slice over one chunk.
If not padded to batch_size in tf.keras, then rebatch likely will copy over boundary.

Signed-off-by: Yong Tang yong.tang.github@outlook.com

yongtang · 2019-07-27T05:42:49Z

/cc @terrytangyuan @BryanCutler @feihugis

/cc @CaptainDuke in case you are interested. I am thinking about apply similar enhancement to HDF5 as well.

CaptainDuke · 2019-07-27T07:18:19Z

Many thanks to Yongtang.

Yes, actually contents in HDF5 files do not need to decode. Also I'm working on HDF5 files with diffierent size. For example.

# h5ls test_data_level_6/10.hdf5 
atk_diff            Dataset {5120, 1}
emy_vec_5           Dataset {5120, 429}
frame                    Dataset {5120, 1}
global_info              Dataset {5120, 68}
hot_label                Dataset {5120, 1}
hot_weight               Dataset {5120, 1}
img_data                 Dataset {5120, 5, 31, 31}
...

I believe such enhancement would be helpful.
BTW, is #342 issue bug related to this problem?

The preivous limitation in Dataset was that Dataset was a iterable so sequence length is unknown until graph runtime. In this PR, we provide an helper function to read the specs of parquet file and lenth is know.

yongtang · 2019-07-27T18:28:15Z

@CaptainDuke the issue #342 you are referring to, might not be directly related to this problem. However, the recent changes in upstream tf.data: tensorflow/tensorflow@c5c1839

might make things complicated as we likely will need to update API pretty soon. With the ongoing rework of cache size and tf.io pipeline to interact with tf.data, it might make sense to fix that together with the PR here.

…er mode This fix is part of the effort to improve overall Dataset for easy access and better cache size in eager mode. See 382 and 366 for related discussions. In order to be able to read file either in filename or in mmeory, this PR adds an SizedRandomAccessFile which allows to provide an optional memory buffer as file content. This could be useful in process compression or archives where we could just read the uncompressed file content into memory. The preivous limitation in Dataset was that Dataset was a iterable so sequence length is unknown until graph runtime. In this PR, we provide an helper function to read the specs of parquet file and lenth is know. This also could open other avenues such as map parquet file with __getitem__ and __len__. Further, parquet file could be read into a Tensor and processed easily (such as pandas like API). The read_parquet_specs could be similarly applied to HDF5 which is more important: HDF5 could have dataset with different sizes. Summary: 1) Two basic C++ kernel ops are implemnted: read_parquet_specs and read_parquet 2) One ParquetDataset that is python implementation only (no C++ anymore) 3) ParquetDataset support eager and graph mode, in graph mode, dtype and shape are provided by user explicitly. In eager mode, only column name is needed. 4) read_parquet works in eager and graph mode, can read records either in full, or in slices 5) read_parquet_specs works in eager mode only (limitation). For cache batch vs. batch in tf.keras 1) Added a hidden `capacity` to adjust the cache batch size 2) batch to be passed in tf.keras is unrelated to `capacity`, but we could use `rebatch` to change at the end of the pipeline. 3) `capacity` could be padded to allow `rebatch` to only cut a slice over one chunk. If not padded to `batch_size` in tf.keras, then `rebatch` likely will copy over boundary. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

…er mode (tensorflow#384) * Rework on ParquetDataset for easy access and better cache size in eager mode This fix is part of the effort to improve overall Dataset for easy access and better cache size in eager mode. See 382 and 366 for related discussions. In order to be able to read file either in filename or in mmeory, this PR adds an SizedRandomAccessFile which allows to provide an optional memory buffer as file content. This could be useful in process compression or archives where we could just read the uncompressed file content into memory. The preivous limitation in Dataset was that Dataset was a iterable so sequence length is unknown until graph runtime. In this PR, we provide an helper function to read the specs of parquet file and lenth is know. This also could open other avenues such as map parquet file with __getitem__ and __len__. Further, parquet file could be read into a Tensor and processed easily (such as pandas like API). The read_parquet_specs could be similarly applied to HDF5 which is more important: HDF5 could have dataset with different sizes. Summary: 1) Two basic C++ kernel ops are implemnted: read_parquet_specs and read_parquet 2) One ParquetDataset that is python implementation only (no C++ anymore) 3) ParquetDataset support eager and graph mode, in graph mode, dtype and shape are provided by user explicitly. In eager mode, only column name is needed. 4) read_parquet works in eager and graph mode, can read records either in full, or in slices 5) read_parquet_specs works in eager mode only (limitation). For cache batch vs. batch in tf.keras 1) Added a hidden `capacity` to adjust the cache batch size 2) batch to be passed in tf.keras is unrelated to `capacity`, but we could use `rebatch` to change at the end of the pipeline. 3) `capacity` could be padded to allow `rebatch` to only cut a slice over one chunk. If not padded to `batch_size` in tf.keras, then `rebatch` likely will copy over boundary. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Fix build failures Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Rename read_parquet_columns => list_parquet_columns Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Remove batch args, and add test in graph mode Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang requested review from BryanCutler and terrytangyuan July 27, 2019 05:41

yongtang mentioned this pull request Jul 28, 2019

Add read_archive and list_archive_entries support #389

Merged

yongtang force-pushed the parquet branch 2 times, most recently from 2da335c to 31b7243 Compare July 28, 2019 06:44

This was referenced Jul 28, 2019

Rework on HDF5: add list_hdf5_datasets and read_hdf5 ops #392

Merged

Revamp R package to be compatible with most recent version of tensorflow-io #381

Closed

yongtang force-pushed the parquet branch 2 times, most recently from 65e24f1 to 378b9c2 Compare July 31, 2019 06:13

yongtang added kokoro:force-run kokoro:run Kokoro CI labels Jul 31, 2019

kokoro-team removed kokoro:run Kokoro CI kokoro:force-run labels Jul 31, 2019

This was referenced Jul 31, 2019

Discuss Batch Standards in TFIO with Keras #382

Open

Add list_feather_columns function in eager mode #404

Merged

yongtang force-pushed the parquet branch from 378b9c2 to 2a2f113 Compare August 4, 2019 01:30

yongtang added 4 commits August 5, 2019 15:20

Fix build failures

2b9980d

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Rename read_parquet_columns => list_parquet_columns

5901996

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Remove batch args, and add test in graph mode

fe9eb2a

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang force-pushed the parquet branch from 2a2f113 to fe9eb2a Compare August 5, 2019 15:21

terrytangyuan approved these changes Aug 5, 2019

View reviewed changes

yongtang merged commit 1642da1 into tensorflow:master Aug 5, 2019

yongtang deleted the parquet branch August 5, 2019 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework on ParquetDataset for easy access and better cache size in eager mode #384

Rework on ParquetDataset for easy access and better cache size in eager mode #384

yongtang commented Jul 27, 2019 •

edited

Loading

yongtang commented Jul 27, 2019

CaptainDuke commented Jul 27, 2019

yongtang commented Jul 27, 2019

Rework on ParquetDataset for easy access and better cache size in eager mode #384

Rework on ParquetDataset for easy access and better cache size in eager mode #384

Conversation

yongtang commented Jul 27, 2019 • edited Loading

yongtang commented Jul 27, 2019

CaptainDuke commented Jul 27, 2019

yongtang commented Jul 27, 2019

yongtang commented Jul 27, 2019 •

edited

Loading